An Unsupervised Machine Learning Approach to Segmentation of Clinician-Entered Free Text
نویسندگان
چکیده
Natural language processing, an important tool in biomedicine, fails without successful segmentation of words and sentences. Tokenization is a form of segmentation that identifies boundaries separating semantic units, for example words, dates, numbers and symbols, within a text. We sought to construct a highly generalizeable tokenization algorithm with no prior knowledge of characters or their function, based solely on the inherent statistical properties of token and sentence boundaries. Tokenizing clinician-entered free text, we achieved precision and recall of 92% and 93%, respectively compared to a whitespace token boundary detection algorithm. We classified over 80% of punctuation characters correctly, based on manual disambiguation with high inter-rater agreement (kappa=0.916). Our algorithm effectively discovered properties of whitespace and punctuation in the corpus without prior knowledge of either. Given the dynamic nature of biomedical language, and the variety of distinct sublanguages used, the effectiveness and generalizability of our novel tokenization algorithm make it a valuable tool.
منابع مشابه
Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts
In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification...
متن کاملAdaptive scene-dependent filters in online learning environments
In this paper we propose the Adaptive Scene Dependent Filters (ASDF) to enhance the online learning capabilities of an object recognition system in real-world scenes. The ASDF method proposed extends the idea of unsupervised segmentation to a flexible, highly dynamic image segmentation architecture. We combine unsupervised segmentation to define coherent groups of pixels with a recombination st...
متن کاملDisambiguating proteins, genes, and RNA in text: a machine learning approach
We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecu...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- AMIA ... Annual Symposium proceedings. AMIA Symposium
دوره شماره
صفحات -
تاریخ انتشار 2007